Updates `zarr-parser` to use obstore `list_async` instead of concurrent_map by norlandrhagen · Pull Request #892 · zarr-developers/VirtualiZarr

norlandrhagen · 2026-02-26T18:38:22Z

Closes Speed up ZarrParser using obstore and Arrow? #891
Tests passing
Full type hint coverage
Changes are documented in docs/releases.rst
Swaps out the _concurrent_map in build_chunk_mapping with obstore's list_async.
Constructs the python ChunkManifest object's numpy arrays directly from the Arrow arrays. *
* There is still a conversion to a dict, so not quite.
Bonus - removes the zarr vendor code.

TomNicholas · 2026-02-26T18:42:08Z

virtualizarr/parsers/zarr.py

-    lengths = await _concurrent_map(
-        [(k,) for k in chunk_keys], zarr_array.store.getsize
-    )
+    lengths = [size_map[k] for k in chunk_keys]


I think we really want to work hard to avoid creating any python lists / dicts at all

instead we want obstore -> arrow -> numpy

via https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy

I think the hardest part of this is dealing with logic for missing keys - arrow might return these a nulls, but the to_numpy conversion doesn't support nulls?

Any operations we do should either be as pyarrow arrays or as numpy arrays, never as python collections

TomNicholas · 2026-02-26T18:45:13Z

virtualizarr/parsers/zarr.py

+    stream = zarr_array.store.store.list_async(prefix=prefix, return_arrow=True)
+    async for batch in stream:
+        size_map.update(
+            zip(batch.column("path").to_pylist(), batch.column("size").to_pylist())


is this zipping of pylists creating a python dict? we want to avoid that

virtualizarr/parsers/zarr.py

TomNicholas · 2026-02-26T18:53:14Z

You will also want to add a new (private for now) constructor to the ChunkManifest class that accepts 3 pyarrow arrays, of type variable-length string, int, and int. The new constructor can just call the existing .from_numpy constructor.

norlandrhagen · 2026-02-27T01:26:04Z

Hmm, now hitting a kerchunk error:

FAILED virtualizarr/tests/test_writers/test_kerchunk.py::TestAccessor::test_accessor_to_kerchunk_parquet - ValueError: Error converting column "path" to bytes using encoding UTF8. Original error: Unable to avoid copy while creating an array as requested.

virtualizarr/manifests/manifest.py

TomNicholas · 2026-02-27T04:13:21Z

virtualizarr/manifests/manifest.py

+    def _from_arrow(
+        cls,
+        *,
+        chunk_keys: "pa.Array",


I don't know that you need to pass this - maybe instead we should pass arrow arrays with nulls for unintialized chunks?

TomNicholas · 2026-02-27T04:17:17Z

virtualizarr/parsers/zarr.py

+
+    path_batches = []
+    size_batches = []
+    stream = zarr_array.store.store.list_async(prefix=prefix, return_arrow=True)


Just grabbing the underlying obstore store is a interesting idea...

Co-authored-by: Tom Nicholas <tom@earthmover.io>

TomNicholas · 2026-02-27T16:54:52Z

now hitting a kerchunk error

This should be unit testable without using Kerchunk or Icechunk. We are simply creating the ManifestStore in a more optimized way. If all array dtypes and so on are the same as before at that step we should not hit any problems later.

…ape]. Moves all weird arrow reshaping into zarr:build_chunk_manifest

norlandrhagen · 2026-02-27T17:49:41Z

This should be unit testable without using Kerchunk or Icechunk. We are simply creating the ManifestStore in a more optimized way. If all array dtypes and so on are the same as before at that step we should not hit any problems la

Totally agree! I think... the kerchunk errors are unrelated. I added pyarrow and arro3-core to a zarr-parser opt dependency and added that to the py11 and py12 tests. Maybe this caused the kerchunk bug. I can check that in a separate issue/pr.

…store-list

…tests

codecov · 2026-03-06T21:30:54Z

Codecov Report

❌ Patch coverage is 95.77465% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.11%. Comparing base (3287d82) to head (d96d5c5).

Files with missing lines	Patch %	Lines
virtualizarr/parsers/zarr.py	95.45%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #892      +/-   ##
==========================================
- Coverage   89.33%   89.11%   -0.23%     
==========================================
  Files          34       33       -1     
  Lines        1997     2030      +33     
==========================================
+ Hits         1784     1809      +25     
- Misses        213      221       +8

Files with missing lines	Coverage Δ
virtualizarr/accessor.py	`95.69% <100.00%> (+0.09%)`	⬆️
virtualizarr/manifests/manifest.py	`85.41% <100.00%> (+0.10%)`	⬆️
virtualizarr/parsers/zarr.py	`94.55% <95.45%> (-3.66%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

TomNicholas · 2026-03-06T21:42:42Z

virtualizarr/manifests/manifest.py

+            pc.is_null(lengths), pa.scalar(0, pa.uint64()), lengths
+        ).to_numpy(zero_copy_only=False)
+
+        if shape is not None:


What happens if shape is None? Should that even be allowed?

TomNicholas · 2026-03-06T21:43:25Z

virtualizarr/manifests/manifest.py

+        paths_np = (
+            pc.if_else(pc.is_null(paths), "", paths)
+            .to_numpy(zero_copy_only=False)
+            .astype(np.dtypes.StringDType())
+        )
+        offsets_np = pc.if_else(
+            pc.is_null(offsets), pa.scalar(0, pa.uint64()), offsets
+        ).to_numpy(zero_copy_only=False)
+        lengths_np = pc.if_else(
+            pc.is_null(lengths), pa.scalar(0, pa.uint64()), lengths
+        ).to_numpy(zero_copy_only=False)


Lets split the arrow compute operations from the numpy conversions if only because it makes it easier to read.

TomNicholas · 2026-03-06T21:55:06Z

virtualizarr/parsers/zarr.py

-            chunk_grid_shape = tuple(
-                math.ceil(s / c) for s, c in zip(zarr_array.shape, zarr_array.chunks)
-            )
+    # scalar arrays go through the dict path instead of the pure arrow bit


It would be nice to not have to keep the whole old codepath around just for this special case...

TomNicholas · 2026-03-06T22:19:04Z

virtualizarr/parsers/zarr.py

-    return ChunkManifest(chunk_map)
+    normalized_keys, full_paths, all_lengths = result
+
+    # Incoming: lots of LLM arrow mumbo jumbo for sparse arrays


there's a lot going on here that I'm suspicious could be simplified

Totally agree. I took a shot at trying to simplify it a bit. The handling of sparse arrays makes it a bit verbose.

virtualizarr/tests/test_parsers/test_zarr.py

TomNicholas · 2026-03-07T16:19:25Z

virtualizarr/parsers/zarr.py

-                flat_positions,
-                pc.multiply(pc.cast(dim_indices, pa.int64()), dim_stride),
-            )
+    split_keys = pc.split_pattern(normalized_keys, pattern=".")


The chunk key encoding could also be "/" - we can probably read that from the zarr.json and use it here?

norlandrhagen added 2 commits February 26, 2026 11:30

updates zarr-parser to use obstore list_async instead of concurrent_map

aa93b8b

removes the zarr vendor code

37dff68

norlandrhagen added performance parsers labels Feb 26, 2026

norlandrhagen temporarily deployed to test-release February 26, 2026 18:38 — with GitHub Actions Inactive

TomNicholas reviewed Feb 26, 2026

View reviewed changes

virtualizarr/parsers/zarr.py Show resolved Hide resolved

adds arro3-core to zarr group

2fa25a7

norlandrhagen temporarily deployed to test-release February 26, 2026 18:49 — with GitHub Actions Inactive

adds _from_arrow method

626d0b9

norlandrhagen had a problem deploying to test-release February 27, 2026 01:06 — with GitHub Actions Failure

adds type_checking for pa type hint + import in _from_arrow

9d6a312

norlandrhagen had a problem deploying to test-release February 27, 2026 01:16 — with GitHub Actions Failure

extra import removed

bab147d

norlandrhagen temporarily deployed to test-release February 27, 2026 01:18 — with GitHub Actions Inactive

adds zarr to test-py31* test group

17e35cc

norlandrhagen temporarily deployed to test-release February 27, 2026 01:23 — with GitHub Actions Inactive

TomNicholas reviewed Feb 27, 2026

View reviewed changes

virtualizarr/manifests/manifest.py Outdated Show resolved Hide resolved

TomNicholas reviewed Feb 27, 2026

View reviewed changes

TomNicholas mentioned this pull request Feb 27, 2026

Ingesting arbitrarily large native Zarr stores by batching arrow streams #894

Open

Update virtualizarr/manifests/manifest.py

6cbb7c0

Co-authored-by: Tom Nicholas <tom@earthmover.io>

norlandrhagen temporarily deployed to test-release February 27, 2026 16:24 — with GitHub Actions Inactive

updates _from_arrow method to have paths, offsets, lengths and opt[sh…

b400a34

…ape]. Moves all weird arrow reshaping into zarr:build_chunk_manifest

norlandrhagen temporarily deployed to test-release February 27, 2026 17:37 — with GitHub Actions Inactive

merge w/ main

19122a7

norlandrhagen temporarily deployed to test-release March 6, 2026 19:28 — with GitHub Actions Inactive

update releases.md

e22981f

norlandrhagen temporarily deployed to test-release March 6, 2026 19:31 — with GitHub Actions Inactive

norlandrhagen added 2 commits March 6, 2026 13:32

mypy

fda8ce6

mypy-2

bbd6a1f

norlandrhagen temporarily deployed to test-release March 6, 2026 20:35 — with GitHub Actions Inactive

update pyproj

9cba9e8

norlandrhagen temporarily deployed to test-release March 6, 2026 20:49 — with GitHub Actions Inactive

adds new zarr parser deps and fix to acccessor

f50b724

norlandrhagen mentioned this pull request Mar 6, 2026

Update kerchunk writer refs_to_dataframe to deal with ArrowStringArray from new ZarrParser deps. #907

Merged

2 tasks

norlandrhagen added 2 commits March 6, 2026 13:56

Merge branch 'kerchunk_parquet_writer_pyarrow_fx' into zarr-parser-ob…

1be91cc

…store-list

fix double pyproj def

4ed8295

norlandrhagen temporarily deployed to test-release March 6, 2026 20:58 — with GitHub Actions Inactive

adds requires pyarrow decorator to the test_zarr so mins deps are ok

9114613

norlandrhagen temporarily deployed to test-release March 6, 2026 21:09 — with GitHub Actions Inactive

add strange pyarrow pandas context override to more test_kerchunk.py …

31c8ed0

…tests

norlandrhagen temporarily deployed to test-release March 6, 2026 21:16 — with GitHub Actions Inactive

mypy again

e0ddfc2

norlandrhagen temporarily deployed to test-release March 6, 2026 21:24 — with GitHub Actions Inactive

norlandrhagen marked this pull request as ready for review March 6, 2026 21:30

TomNicholas reviewed Mar 6, 2026

View reviewed changes

incorporate feedback

d96d5c5

norlandrhagen temporarily deployed to test-release March 6, 2026 23:36 — with GitHub Actions Inactive

TomNicholas reviewed Mar 7, 2026

View reviewed changes

TomNicholas approved these changes Mar 7, 2026

View reviewed changes

Conversation

norlandrhagen commented Feb 26, 2026 • edited by TomNicholas Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TomNicholas commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

norlandrhagen commented Feb 27, 2026

Uh oh!

Uh oh!

TomNicholas Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomNicholas commented Feb 27, 2026

Uh oh!

norlandrhagen commented Feb 27, 2026

Uh oh!

codecov bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TomNicholas Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

norlandrhagen commented Feb 26, 2026 •

edited by TomNicholas

Loading

TomNicholas commented Feb 26, 2026 •

edited

Loading

TomNicholas Feb 27, 2026 •

edited

Loading

codecov bot commented Mar 6, 2026 •

edited

Loading

TomNicholas Mar 7, 2026 •

edited

Loading